Code repository intrusion detection

ABSTRACT

The disclosed subject matter provides for code repository intrusion detection. A code developer profile can be generated based on characteristic features present in code composed by the developer. Characteristic features can be related to the coding propensities peculiar to individual developers and, over sufficient numbers of characteristic features, can be considered pseudo-signatures. A target code set is analyzed in view of one or more developer profiles to generate a validation score related to a likelihood of a particular developer composing a portion of the target code set. This can serve to confirm or refute a claim of authorship, or can serve to identify likely author candidates from a set of developers. Where the target code set authorship is determined to be sufficiently suspect, the code set can be subjected to further scrutiny to thwart intrusion into the code repository.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority to U.S. Provisional PatentApplication No. 61/661,617, entitled “SOURCE ANALYSIS AND INTRUSIONDETECTION FOR COMPUTING SYSTEMS,” filed on Jun. 19, 2012, which ishereby incorporated by reference in its entirety.

TECHNICAL FIELD

The disclosed subject matter relates to characterizing features ofinformation to be stored electronically and, more particularly, toanalysis of features to determine confidence in an identity associatedwith information to be stored electronically.

BACKGROUND

By way of brief background, information can be stored electronically indata stores. As an example, one common use of a data store is to storecomputer programming code or source code electronically. These exemplarydata stores for computer programming code can facilitate codedevelopers' access to the stored source code. As such, a developer cantypically access source code, update the source code, and store theupdated source code in the data store. In an aspect, the developer canbe local to the data store, although it is common for the developer tobe located remote from the data store. Remote access to the data storefor interacting with computer programming code can facilitate ageographically diverse set of code developers. Further, such exemplarydata stores can typically facilitate interaction with computerprogramming code by a plurality of code developers among other users. Insome embodiments, source code databases can be quite large, evenspanning multiple data stores, and can support interaction with manythousands of code developers and other users.

Code repositories, e.g., data stores with computer programming codestored thereon, can be significant corporate or governmental investmentsand can include valuable source code or information. This valuablesource code or information can, for example, include proprietary sourcecode or information, include code for significant products or familiesof products such as flagship software products, include code forsensitive systems or operations such as security/military systems orpatient records management systems, etc. Security for access to thecomputer programming code can be detailed, complex, and highly evolved.However, accessing code repositories despite security systems can occurand source code can be committed or checked-in to a code base stored onthe data stores in a manner that can be undesirable. Conventionalmechanisms to address these types of situations generally involvesignificant manual review of committed code patches by review personnel.These mechanisms can be expensive, inconvenient, tedious, and slow.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 is an illustration of a system that facilitates code featureanalysis in accordance with aspects of the subject disclosure.

FIG. 2 is a depiction of a system that facilitates code feature analysisand profile development in accordance with aspects of the subjectdisclosure.

FIG. 3 illustrates a system that facilitates code feature analysistraining in accordance with aspects of the subject disclosure.

FIG. 4 illustrates a system that facilitates code feature analysistraining and development of a training data set in accordance withaspects of the subject disclosure.

FIG. 5 is a depiction of a system that facilitates code feature analysistraining with a trusted coder data set in accordance with aspects of thesubject disclosure.

FIG. 6 is a depiction of a system that facilitates code feature analysiswith a weighted profile in accordance with aspects of the subjectdisclosure.

FIG. 7 illustrates a method for facilitating code feature analysis inaccordance with aspects of the subject disclosure.

FIG. 8 illustrates a method for facilitating code feature analysisprofile development in accordance with aspects of the subjectdisclosure.

FIG. 9 depicts a schematic block diagram of a sample-computingenvironment with which the claimed subject matter can interact.

FIG. 10 illustrates a block diagram of a computing system operable toexecute the disclosed systems and methods in accordance with anembodiment.

DETAILED DESCRIPTION

The subject disclosure is now described with reference to the drawings,wherein like reference numerals are used to refer to like elementsthroughout. In the following description, for purposes of explanation,numerous specific details are set forth in order to provide a thoroughunderstanding of the subject disclosure. It may be evident, however,that the subject disclosure may be practiced without these specificdetails. In other instances, well-known structures and devices are shownin block diagram form in order to facilitate describing the subjectdisclosure.

Computer programming code can be stored in code repositories residing ondata storage components. Computer programming code, used hereininterchangeably with the term ‘source code’ unless explicitly statedotherwise, is meant to include any description of a software system andcan be construed to include machine code, very high level languages orexecutable graphical representations of systems. The computerprogramming code can generally be accessed by developers to facilitatedevelopment of the code stored in the code repository. Computerprogramming code can represent a significant investment in terms ofdevelopment and intellectual property to the owners of the code. Codecan also be associated with sensitive information or security/militaryconcerns. As an example, a code repository housing source code foroperation of nuclear reactors can be considered a highly sensitive coderepository. It can be desirable to include safeguards to protect coderepositories. Insertion of malicious code into a code base can representa real threat to the owner of the code. As an example, malicious codecan include code allowing unauthorized entities to access associatedcomputer systems.

Generally, only trusted developers are allowed access to sensitive codeor code repositories. However, as malicious programmers become moresophisticated, in some cases even feasibly being supported by competingcorporations or governments, the ability to commit malicious codesurreptitiously to a code repository by bypassing traditional securitymeasures or masquerading as a trusted developer is becoming more common.As an example, a competing company can approach a trusted developer, whocould be bribed to submit a malicious code patch into a code repositorythat, if incorporated into a built product, could allow the competitorto access the product, cause problems with the deployed product, etc.,to give an advantage to the competitor.

Detecting such intrusions into a code repository can be highlychallenging for conventional systems. One common approach can includehiring a computer forensics team to reconstruct the revision process ofsource code to try to locate when malicious code was inserted. Oneproblem with this approach is that, especially with large or complexsource code, this can be extraordinarily expensive in both time andmoney. Further, due to the expenses, this approach can often only beundertaken after a problem has been detected in the code or deployedproduct. Another approach can include spot-checking of code submissions,such as having code patches randomly reviewed by other developers to tryto catch malicious code before it is incorporated into a build of aproduct. This approach can obviously miss malicious code where a skilleddeveloper can game the system to avoid the review or where there simplyare not enough review sessions to catch a statistically significantamount of malicious code submissions.

Often, literary authors are associated with a style of writing, some sostrongly that even small segments of their work can be easily recognizedsimply by the style of their compositions. This same personal style isalso frequently found for musical composers or artists. Given a largeenough number of characteristics, an author, composer, or artist, couldbe associated with a profile that could reasonably be employed toclassify works with the author, composer, or artist. Similarly, asdisclosed herein, authors of computer programming code, e.g.,developers, can also be associated with code characteristics. As anexample, a developer can frequently link a specific set of libraries,employ a particular keystroke sequence for starting a programmaticcomment, capitalize variable names in a predictable manner, repeatedlyemploy certain function calls, submit (or, in the inverse, not submit)code at a predictable time of day or on specific days, etc., and theparticular combination and weighting of these characteristics can beassociated with that developer's code. As greater numbers ofcharacteristics or characteristics that are more unusual are included,the association of those characteristics with a piece of thatdeveloper's code can act as a signature of sorts.

As disclosed herein, machine learning, statistical inference,normalization, Bayesian filtering, or other mechanisms can be employedin analysis of a developer's coding characteristics to determine aprofile for that developer. A developer profile can then be employed inanalysis of a target set of code to determine a likelihood that thatdeveloper composed the target code. An advantage of applying thesetechniques is that potentially huge numbers of characteristics can beanalyzed and weighted in developing profiles of developers. Further,where large sets of code for a developer are available over time, theseanalysis mechanisms can also allow a profile to identify changes in adevelopers maturing coding style such that analysis of a target code setcould also be informative as to when a piece of code was composed by thedeveloper in much the same manner as Picasso's blue period can bedistinct from his rose period or cubist period, although each period canstill be termed strongly ‘Picasso’.

Where developer profiles can be determined, code check-ins can beanalyzed in view of the profile of the purported developer to facilitatedeterminations of code repository intrusion. As an example, where adeveloper always includes the following numerical sequence,“112358132134”, somewhere in every piece of code she has drafted in thelast 15 years, it can be unlikely that a piece of code she submitted toa code repository last week was composed by her where “112358132134” isnot present in the submitted code. As such, where the exemplarynumerical sequence is missing, the code can be subject to additionalscrutiny.

The following presents other simplified example embodiments of thedisclosed subject matter in order to provide a basic understanding ofsome aspects of the various embodiments. This is not an extensiveoverview of the various embodiments. It is intended neither to identifykey or critical elements of the various embodiments nor to delineate thescope of the various embodiments. Its sole purpose is to present someconcepts of the disclosure in a streamlined form as a prelude to themore detailed description that is presented later.

In an embodiment, a system can include a processor and memory. Theprocessor can facilitate the execution of computer-executableinstructions stored on the memory. The execution of thecomputer-executable instructions can cause the processor to receive acode file set and to identify a characteristic feature associated withthe code file set, a code file of the code file set, or a computerinstruction of a code file of the code file set, wherein one or morecode file of the code file set comprises source code. The processor canfurther determine a feature value related to the characteristic featureand facilitate access to the feature value.

In another embodiment, a method can include receiving, by a systemincluding a processor, a code file set including one or more code filescomprising source code. The method can further include the processoridentifying a characteristic feature present in at least a portion ofthe code file set and determining a feature score for the characteristicfeature. The method can further include the processor facilitatingaccess to the feature score.

In a further embodiment, a device can include a memory storingcomputer-executable instructions and a processor that facilitatesexecution of the computer-executable instructions. These instructionscan cause the processor to receive a target code file set. One or morecode file of the target code file set can comprises source code. Theinstructions can further cause the processor to identify acharacteristic feature associated with the target code file set, a codefile of the code file set, or a computer instruction of a code file ofthe code file set and determine a feature value related to thecharacteristic feature. The processor can further receive a developerprofile comprising a set of characteristic feature values associatedwith the presence of characteristic features correlated to historicalcode files associated with the developer profile. Based on a determinedlevel of confidence that the target code set is authored by a developerassociated with the developer profile, the processor can determine anintrusion score.

To the accomplishment of the foregoing and related ends, the disclosedsubject matter, then, comprises one or more of the features hereinaftermore fully described. The following description and the annexed drawingsset forth in detail certain illustrative aspects of the subject matter.However, these aspects are indicative of but a few of the various waysin which the principles of the subject matter can be employed. Otheraspects, advantages and novel features of the disclosed subject matterwill become apparent from the following detailed description whenconsidered in conjunction with the provided drawings.

FIG. 1 is an illustration of a system 100, which facilitates codefeature analysis in accordance with aspects of the subject disclosure.System 100 can include code repository component (CRC) 102. CRC 102 cancomprise a memory, data store, or other data storage component, wherecode can be stored to facilitate developer access. System 100 canfurther include code repository access component (CRAC) 104 that canfacilitate access to CRC 102. As an example, CRAC 104 can be a webinterface for CRC 102 allowing developers to check code in and out ofthe code repository, e.g., CRC 102. CRAC 104 can be communicativelycoupled to CRC 102 by way of code repository intrusion detectioncomponent (CRID) 110. In an aspect, CRID 110 can be inserted into thecommunicative coupling between CRAC 104 and CRC 102 to facilitateaspects of the presently disclosed subject matter, such as training CRID110 developer profiles, intrusion detection, etc.

CRID 110 can facilitate determining intrusion score information 114 forcode sets. A code set can include one or more lines of code. Code in acode set can be related, such as where the lines of code comprising thecode set are for a single project, program, function, etc. In an aspect,the code set can include groups of code that are not directly related,for example, where a code set includes code snippets, i.e., a smallregion of code generally smaller than a code base, for more than oneproject housed in a code repository. A code base of a programmingproject can be the larger collection of source code of the computerprograms that comprise the project. In another aspect, the code set canbe an empty set having no lines of code.

In an aspect, the determination of intrusion score information 114 canbe based on a comparison of a developer profile to a code set. Intrusionscore information 114 can be a metric for identifying possiblesubmission of code to a code repository, e.g., CRC 102 by way of CRAC104, which may not be composed by a designated developer. As an example,intrusion score information 114 can include a value designating thelikelihood, as a percentage value, that an identified developer composeda target code set. As such, a lower percentage can be associated with anincreased likelihood of subsequent review of the target code set. Wherea malicious programmer masquerades as a trusted developer to submit codeto a repository, analysis of the submitted code can indicate that itincludes characteristics that sufficiently depart from the trusteddevelopers profile so as to subject the submitted code to furtherreview. Where the submitted code is determined to be composed by someoneother than the trusted developer, further action can be undertaken, suchas updating the username/pas sword of the trusted developer, legalaction, further forensic analysis, countermeasures, etc. In an aspect,where an alert or report is generated in response to a potentialintrusion, this alert of report can be received at nearly any targetlocation, target system, target device, target method, etc. As examples,an alert can be received on a mobile device, on a logging system, at amethod for quarantining suspect code, on a mobile device thousands ofmiles away from the code repository, at a thin client in the coderepository facility, etc., without departing from the scope of thedisclosed subject matter.

The disclosed subject matter can mine information from coderepositories, including source code management systems, revision controlsystems and/or versioning systems, such as, concurrent versions system(CVS), Apache Subversion (SVN), Linux GIT, etc. It will be appreciatedthat code repositories include nearly any set of stored code. These setsof stored code can be in a large file repository, such as a corporatecode repository, but can also be stored on nearly any other storagemedium regardless of size or complexity, such as a thumb drive, harddrive, optical drive, cloud storage, RAM, ROM, EEPROM, etc., or nearlyany other memory, as disclosed elsewhere herein. The mined informationcan then be scaled, normalized, trained, or scored using machinelearning algorithms to produce a profile including a weighted set offeatures. Profiles can then be used to make predictions on future codeset check-ins to a code repository. The disclosed approach can abstractinformation in a code repository to profile one or more developers ofcode in the repository. The abstracted information can be persisted to adata store of developer profiles. The profiles can be employed to trainintrusion detection algorithms based on developer history, metrics,propensities, or nearly any other available information. Developerprofiles can be updated and kept current based. Similarly, intrusiondetection can also be kept current allowing for reanalysis of code setswith updated profiles.

In an aspect, machine learning can include the use of logisticalregression techniques. Logistical regression is generally simple totune, fast, and simple to implement. It will be noted that the subjectdisclosure is not limited to logistical regression and that any otherform of statistical analysis can be employed in conjunction with machinelearning without departing from the scope of the presently disclosedsubject matter. In an aspect, regularization, to prevent overfittingdescribing random error or noise instead of the underlying relationship,can be combined with machine learning techniques, e.g., logisticalregression, to develop a more generalized curve fit.

In another aspect, gradient descent can be employed in a training phaseto develop localized minima of the characteristics of a developerprofile. Gradient descent is an iterative algorithm that can work wellto determine minimal weight values for characteristics to facilitateclassification of distinct developers. In an aspect, gradient descentworks in spaces of any number of dimensions, even in nearlyinfinite-dimensional ones, i.e., very large numbers of dimensions, andcan thus be readily employed in optimizations of large numbers ofcharacteristic features. It will be noted that other optimizationalgorithms can be employed without departing from the scope of thepresently disclosed subject matter.

In a further aspect, a support vector machine (SVM) can be employed fordeveloping a profile. The ability of SVM to group characteristics intomultiple groups can allow SVM to be an efficient technique when appliedto the types of characteristics associated with classifying a code setas composed by a developer or not. In an aspect, SVM does typicallyinclude more complexity in debugging, implementation, and additionalcomputational time for training as compared to some regressiontechniques.

A code characteristic can be any data, fact, or knowledge mined orinferred from source code, generated machine code, repository, logs, ormetadata. As such, code characteristics can include spelling variableexpansion wherein the spelling consistency of a developer can beanalyzed. System 100 can parse words and variable names for a code setfor frequency of usage, casing, and other commonalities. Similar to thespelling variable expansion, syntactical keyword expansion can parsecoding keywords and variable names to analyze for frequency of usage,casing, and other commonalities.

Code characteristics can further include the byte size of a comment ormessage associated with code check-in. This can evaluate, for example,how wordy or not wordy a developer is. Another code characteristic canbe the size of the change made to the code between code submissions(e.g., the ‘diff’ per check-in) to evaluate how large of change adeveloper tends to commit. A further code characteristic can be a day ofthe week or time of day the developer has a propensity for checking incode or not checking in code, such as the developer checks in code 85%of the time between 3 am and 5 am on Fridays or has a 0% rate forchecking in code on National holidays.

Another code characteristic can be brace placement to evaluate adeveloper's style for placement of a brace, e.g., a closing brace on thesame line or the next line. Yet another code characteristic can beregular use of uniform spacing near a keyword.

Code characteristics can also be associated with characteristics such asa check-in temporal distance to evaluates, for example, the mean timebetween two check-ins performed by a developer on the same source code.The propensity of a developer to check-in code daily, weekly, monthly,etc., can be analyzed.

The spelling performance of a developer can also be evaluated byperforming spell check on a commit message. Spelling errors can beemployed as characteristic features associated with the developer.Similarly, spelling within the code set itself can also be employed as acharacteristic feature. Likewise, a developer's usage of camel casingcan be tracked. Camel case is the use of capitalization in positionsother than the first letter of a word, e.g., “thESe aRe caMel cASewOrds”.

Another code characteristic can be the number of comments a developercommonly uses compared to the total number of lines in the codecheck-in. This can be determined, for example, by searching for ‘//’,‘/*’, and ‘*/’ and comparing the count to the total number of lines inthe check-in.

Other code characteristics can include coding styles and patterns,typing notation, underscore as spacing usage or other spacing styles,implicit usage of keywords (e.g., else, default, return, continue, next,etc.) when not required, coding of checking (e.g., variable, null, orBoolean checking (!checked) vs. (checked !=null)), ternary operatorusage, variable placement and truncation, “constant identifier”correctness, or other work patterns. Additionally, lexical analysis,including parsers and lattice analysis to develop an overallunderstanding of the machine patterns a code set, can generatecharacteristic features that can be employed in profile development. Itwill be noted that nearly any characteristic feature of a code set,e.g., nearly any data, fact, or knowledge mined or inferred, can beemployed in developing a profile for an associated developer withoutdeparting from the scope of the present disclosure.

In another aspect, malicious code signature detection can increase theaccuracy of detecting intrusion into a code repository. Malicious codesignatures can include code used by a developer in contravention to bestprogramming practices. In an embodiment, low levels of malicious codesignatures can typically be associated with trusted developers and, assuch, higher levels of malicious code signatures can be indicative of anintruder into a code repository. Examples of malicious code signaturescan include: banned API usage (e.g., strcpy); potentially hostilelibrary calls (e.g., system( )); potentially hostile keyword usage(e.g., asm); potentially hostile variable assignment (e.g., uid=0);dangerous patterns; etc.

FIG. 2 is a depiction of a system 200 that can facilitate code featureanalysis and profile development in accordance with aspects of thesubject disclosure. System 200 can include CRC 202 and CRAC 204communicatively coupled to CRID 210. CRC 202 can comprise a memory, datastore, or other data storage component, where code can be stored tofacilitate developer access. CRAC 204 can facilitate access to CRC 202.In an aspect, CRID 210 can be inserted into the communicative couplingbetween CRAC 204 and CRC 202 to facilitate aspects of the presentlydisclosed subject matter, such as training CRID 210 developer profiles,intrusion detection, etc.

CRID 210 can further include profile store component 212 communicativelycoupled to training component 220 and detecting component 230. Profilestore component 212 can store a profile of a developer as disclosedherein. In a further aspect, profile store component 212 can also storepatterns related to nearly any level of code developer granularity, suchas, for a coding project, for a set of code developers (e.g., adeveloper team, product division, etc.), for some or all code developersof one or more relevant business entities (e.g., all code developers ofa corporation, a set of code developers from three cooperatingcompanies, etc.), etc. While these patterns are herein generallyaffiliated with “a developer” for clarity and brevity, the subjectdisclosure is not so strictly limited, and “a developer” can be read toinclude nearly any granularity of code developers, as disclosed, unlessexplicitly or inherently limited to a single developer. It further willbe noted that patterns can include both affirmative patterns, e.g.,patterns derived from the presence of indicia, or negative patterns,e.g., patterns derived from the absence of indicia. Training component220 can develop a profile based on code sets associated with adeveloper. Training component 220 can store a profile on profile storecomponent 212. Detection component 230 can access a profile stored onprofile store component 212 to facilitate determining intrusion scoreinformation 214. The determination of intrusion score information 214can be based on a comparison of a developer profile to a code set.Intrusion score information 214 can be a metric for identifying possiblesubmission of code to a code repository.

In an aspect, training, profile access, and determination of possibleintrusion into a code repository can be included in CRID 210. This canfacilitate real-time updates to developer profiles as each developersubmits code sets for a code repository. In an embodiment, thecolocation of training, profile access, and intrusion determination canbe embodied in a “black box” CRID 210 component that can be inserted infront of CRC 202 in a relatively seamless manner. This can facilitatethe easy installation of an intrusion detection system into conventionalcode repository systems. In other embodiments, training component 220,profile store component 212, and detecting component 230 can be locatedin other than internal to CRID 210. As an example, profile storecomponent 212 can be included in CRC 202 rather than in CRID 212.

In another aspect, CRID 210 can interface with a source code editor. Inan embodiment, CRID 210 can be communicatively coupled to a source codeeditor. In a further embodiment, CRID 210 can be integrated into asource code editor. A source code editor can facilitate editing sourcecode and can include, for example a text editor, an integrateddevelopment environment (IDE), etc. Where CRID 210 is interfaced to aneditor, CRID 210 can have access to real-time developer interactionswith the source code editor. These real-time interactions with thesource code editor can provide additional information regardingcharacteristics of a developer that can be included in a developerprofile. As an example, where a developer frequently mistakenly keys“hte” [sic] when typing “the”, this idiosyncrasy can be tracked evenwhere “hte” [sic] would be corrected before submission to a coderepository in most circumstances. Further, an integrated CRID 210 canmonitor environmental conditions for the source code editor, such as,other programs running in the background, hardware information, time/dayinformation, login/password features, source code editor softwareregistration information, etc. As an example, where a source code editorserial number is registered to a trusted developer, that editor isrunning on system with an identifiable CPU and known MAC address, the IPaddress for the system is in the city that the developer lives in, andthe developer is logged into a known email account in the background,these characteristics can be considered to increase the likelihood thatcode submitted would be from the trusted developer. As a corollary,where code is submitted by a system having a source code editor that isnot registered to the trusted developer, the real-time editing of thesource code lacks idiosyncratic keying errors associated with thetrusted developer, and the IP address is located in a foreign countrynot associated with the trusted developer, the submitted code can beconsidered more suspect and be subject to further validation processesbefore being committed to the code repository. Interfacing CRID 210 witha source editor can allow for developer profiles based oncharacteristics present in the creation/editing process of code inaddition to characteristics determined from analysis of a ‘final draft’of the code submitted for inclusion in a code repository.

FIG. 3 illustrates a system 300 that facilitates code feature analysistraining in accordance with aspects of the subject disclosure. System300 can include CRC 302 and CRAC 304 communicatively coupled to trainingcomponent 320. CRC 302 can comprise a memory, data store, or other datastorage component, where code can be stored to facilitate developeraccess. CRAC 304 can facilitate access to CRC 302. In an aspect,training component 320 can be inserted into the communicative couplingbetween CRAC 304 and CRC 302 to facilitate aspects of the presentlydisclosed subject matter, such as training developer profiles.

Training component 320 can further be communicatively coupled to profilestore component 312. Profile store component 312 can store a profile ofa developer as disclosed herein. Training component 320 can develop aprofile based on code sets associated with a developer. Trainingcomponent 320 can store a profile on profile store component 312.

Training component 320 can access training data set 322. Training dataset 322 can comprise source code information associated with a developerand can be used to develop a profile for the developer. In someembodiments, training data set 322 can comprise source code informationfor a plurality of developers and can facilitate determining profilesfor a plurality of developers. Further, in some embodiments, trainingdata set 322 can include a historical set of source code informationsuch as an existing code repository and access log. In anotherembodiment, training data set 322 can comprise source code informationfor a plurality of projects for one or more developer profile. As such,where a code repository already includes committed source code, and anintrusion detection system is being added to monitor future code commitsand/or examine exiting committed code files, the existing code tree, orparts thereof, can be employed as part of a training data set. As anexample, where a company has a code repository with existing code from100 developers, the existing code can be employed as training data set322 to generate developer profiles for the 100 developers. Theseprofiles can then be employed to analyze future code check-ins into thecode repository.

Further, profiles generated from an existing code repository can be usedto check the existing committed code stored in the code repository.Given that a malicious entity might have submitted code into the coderepository under the guise of a trusted developer, there can beanomalistic behavior for characteristics of the trusted developer intraining data set 322 used to generate a profile for the trusteddeveloper. As an example, where trusted developer has only one codecheck-in on a Saturday in the entire training data set, the associatedcommitted code can be treated as suspect due to the anomaly. Where thatsame committed code also has the word “color” spelled “colour” and thischaracteristic is not present in any other committed code from thetrusted developer, the committed code can be treated as more suspect dueto a second anomaly. Suspect committed code can be removed from trainingset 322 before use in generating the trusted developer profile. This canbe viewed as akin to disregarding a fastest and slowest lap time whencomputing an average lap time for a runner. The generated profile canthen be run against the committed code and the suspect committed codefile can be compared to the developer profile. Where the Saturdaycheck-in and strange use of “colour” are significant in view of theother characteristic features of the trusted developer, the committedcode file can be determined to be suspect and subjected to furtherreview processes. As in the previous “runner example”, this can belikened to the throwing out the slowest lap time of the runner beforecalculating the average lap time and then seeing that the average laptime is substantially faster than the slowest lap time and would noteasily be attributed to the same runner. However, where the Saturdaycheck-in and strange use of “colour” are, for example, the only twoanomalies across perhaps several thousand characteristics of the trusteddeveloper, the level of suspicion for that committed code may beminuscule and the file can be passed over for further inspection.Returning to the “runner example”, this can be akin to the throwing outthe slowest lap time of the runner before calculating the average laptime and then seeing that the average lap time and the slowest lap timeare reasonably close and could easily be attributed to the same runner.

In other embodiments, training data set 322 can include an existingtraining data set, such as a data set that has been previouslymanipulated to facilitate training one or more developer profiles. Assuch, for example, a developer can provide an employer with theirexisting developer profile, e.g., a developer profile generated atanother employer, etc., or with an existing training data set, e.g., atraining data set including code files known to be attributable to thedeveloper, to rapidly generate a new profile for the developer. This canfacilitate inclusion of new trusted developers into a code repositoryintrusion detection system in a rapid and efficient manner. Anomalies inthe characteristics of the submitted developer profile and/or existingtrading data set can be addressed, for example, in a manner as disclosedfor using an existing code tree as a training data set.

In some embodiments, training data set 322 can be compiled over time.Training data set 322 can be generated and updated as submitted codeaccrues. As an example, where training component 320 is included in anew code repository system, there can be no submitted code from which togenerate training data set 322. As code is newly submitted by trusteddevelopers into the young code repository, the committed code can beemployed as training data set 322. As such, training data set 322 canchange with each committed code file. These embodiments can be easilyvisualized for start-up companies employing new developers that accessan essentially empty code tree on the code repository. As the newdevelopers submit code, that code can be used to generate training datasets to generate developer profiles.

Training component 320 can further include profile engine 340 that cangenerate or update a profile based in part on training data set 322.Profile engine can then facilitate storage of a generated profile onprofile store component 312. In an aspect profile engine 340 can analyzetraining data set 322 based on characteristic features associated with adeveloper to generate a developer profile. The developer profile canembody one or more characteristics, and in many cases large pluralitiesof characteristics, to form a pseudo-signature for a trusted developer.Many characteristic features mined form training data set 322 can bequite subtle, such as particular spellings of words, punctuation styles,word orders, repetitive use of select function calls or libraries,comment line usage and placement, time/day of committing code sets,location information, use of identified hardware for submissions, etc.Some of the many possible characteristics are disclosed herein withregard to FIG. 1. It will be appreciated that these and many morecharacteristics can be employed by profile engine 340 in generating orupdating a profile.

Training component 320 can further include self-analysis component 350.Self-analysis component 350 can receive start-point code set 352, whichcan be stored on CRC 302. Start-point code set 352 can be a snapshot ofdeveloped code at any point in time chosen to represent a startingpoint. Self-analysis component 350 can receive training data set 322 andcan extract modifications of the code tree stored on CRC 322 based ontraining data set 322. Self-analysis component 350 can then generateself-evaluation information 354 based on start-point code set 352 andextracted modifications. This can facilitate generating a historicalstate of the code tree stored on CRC 302 at any point associated with anextracted modification. In an aspect, this can allow training component320 to reconstruct the sequential development of the code tree on CRC302. Profiles can then be checked against the sequential development ofthe code tree to determine potential intrusions into the coderepository.

FIG. 4 illustrates a system 400 that facilitates code feature analysistraining and development of a training data set in accordance withaspects of the subject disclosure. System 400 can include CRC 402 andCRAC 404 communicatively coupled to training component 420. CRC 402 cancomprise a memory, data store, or other data storage component, wherecode can be stored to facilitate developer access. CRAC 404 canfacilitate access to CRC 402. In an aspect, training component 420 canbe inserted into the communicative coupling between CRAC 404 and CRC 402to facilitate aspects of the presently disclosed subject matter, such astraining developer profiles. Training component 420 can further becommunicatively coupled to profile store component 412. Profile storecomponent 412 can store a profile of a developer as disclosed herein.Training component 420 can develop a profile based on code setsassociated with a developer. Training component 420 can store a profileon profile store component 412.

Training component 420 can access training data set 422. Training dataset 422 can comprise source code information associated with a developerand can be used to develop a profile for the developer. In someembodiments, training data set 422 can comprise source code informationfor a plurality of developers and can facilitate determining profilesfor a plurality of developers. Further, in some embodiments, trainingdata set 422 can include a historical set of source code informationsuch as an existing code repository and access log. Training data set422 can be stored local to training component 420, such as by storingtraining data set 422 on local store component 428.

Training component 420 can further comprise pull component 424. Pullcomponent 424 can be communicatively coupled with CRC 402 to causetraining component 420 to receive source code information from a coderepository, such as a code repository stored on CRC 402. Source codeinformation can include information about individual revisions to asource code file. As an example, pull component 424 can initiate thereception of source code information at training component 420, whereinthe source code information can be an initial source code file and eachrevision of the source code file from a code tree stored in a coderepository residing on, for example, CRC 402. In an embodiment, thesource code files received by training component 420 can be stored onlocal store component 428.

Further, training component 420 can comprise difference component 425that can facilitate training component 420 receiving information on codeupdates or code snippets committed by a developer for revisions of asource code file. As an example, difference component 425 can initiatethe reception of code snippets in a code tree for by training component420. In an embodiment, the code snippets received by training component420 can be stored on local store component 428.

Training component 420 can also include data component 426 that canfacilitate training component 420 receiving an initial revision of asource code file of a code repository, for example, residing on CRC 402.Data component 426 can persist the initial revision of the source codefile on training component 420, for example, by storing a copy of theinitial revision on local store component 428. Training component 420can, at least in part, reconstruct a source code archive from theinitial revision received by way of data component 426 and the .codesnippets received by way of difference component 425.

In an aspect, the interaction of pull component 424, differencecomponent 425, and data component 426 can provide valuable informationabout a source code set stored in a code repository at various stages ofdevelopment. It can be likened to a mapping exercise in which theorigination point, destination point, and travel legs can be analogousto an initial source code file, a current source code file, and codesnippets related to changes to the source code file. In this example, apull action can get at the origination and destination information. Adifference action can get at the travel legs. A data action canreconstruct the journey by appropriately combining the originationinformation and the travel leg information. The reconstructed journeycan then be compared to the destination information. In an aspect, theinformation caused to be received by training component 420 by way ofpull component 424, difference component 425, and data component 426 canbe embodied, at least in part, in training data set 422.

System 400 can further comprise normalization component 427 that cannormalize the information received at training component 420 by way ofpull component 424, difference component 425, and data component 426. Asan example, where code check-in times for a first developer are withinminutes of each other and for a second developer are within hours ofeach other, a check-in time characteristic can be normalized by way ofnormalization component 427, to hours rather than minutes or days. Asanother example, a spelling error characteristic can be normalized, bynormalization component 427, as a deviation from a mean number ofspelling errors for all developers of code in a branch of a code tree.This exemplary spelling error characteristic can be different form aspelling error characteristic for the whole code tree, from a differentbranch of the tree, from a sub-branch of the tree, etc. As a furtherexample, the size of individual developer's code changes for eachsubmitted code set can be normalized against the average size of codechanges across multiple developers. Thus, in this example, where a firstdeveloper changes an average of 450 lines of code in 100 submitted codesets, and a second developer changes an average of 650 lines of code in50 submitted code sets, then where the first developer submits 517 linesof code in a code set under examination, this can be treated assubmitting an average sized code set, i.e., the average of 450 lines for100 sets and 650 lines for 50 sets is 516.666. In an aspect, informationnormalized by way of normalization component 427 can be embodied, atleast in part, in training data set 422.

Training component 420 can further include profile engine 440 that cangenerate or update a developer profile based in part on training dataset 422. Profile engine can then facilitate storage of a generatedprofile on profile store component 412. In an aspect, profile engine 440can analyze training data set 422 based on characteristic featuresassociated with a developer to generate a developer profile. Thedeveloper profile can embody one or more characteristics, and in manycases large pluralities of characteristics, to form a pseudo-signaturefor a trusted developer. Many characteristic features mined formtraining data set 422 can be quite subtle, such as particular spellingsof words, punctuation styles, word orders, repetitive use of selectfunction calls or libraries, comment line usage and placement, time/dayof committing code sets, location information, use of identifiedhardware for submissions, etc. Some of the many possible characteristicsare disclosed herein with regard to FIG. 1. It will be appreciated thatthese and many more characteristics can be employed by profile engine440 in generating or updating a profile.

FIG. 5 is a depiction of a system 500 that facilitates code featureanalysis training with a trusted coder data set in accordance withaspects of the subject disclosure. System 500 can include CRC 502 andCRAC 504 communicatively coupled to training component 520. CRC 502 cancomprise a memory, data store, or other data storage component, wherecode can be stored to facilitate developer access. CRAC 504 canfacilitate access to CRC 502. In an aspect, training component 520 canbe inserted into the communicative coupling between CRAC 504 and CRC 502to facilitate aspects of the presently disclosed subject matter, such astraining developer profiles. Training component 520 can further becommunicatively coupled to profile store component 512. Profile storecomponent 512 can store a profile of a developer as disclosed herein.Training component 520 can develop a profile based on code setsassociated with a developer. Training component 520 can store a profileon profile store component 512.

Training component 520 can access training data set 522. Training dataset 522 can comprise source code information associated with a developerand can be used to develop a profile for the developer. In someembodiments, training data set 522 can comprise source code informationfor a plurality of developers and can facilitate determining profilesfor a plurality of developers. Further, in some embodiments, trainingdata set 522 can include a historical set of source code informationsuch as an existing code repository and access log. Training data set522 can be stored local to training component 520, such as by storingtraining data set 522 on local store component 528.

Training component 520 can further comprise pull component 524. Pullcomponent 524 can be communicatively coupled with CRC 502 to causetraining component 520 to receive source code information from a coderepository, such as a code repository stored on CRC 502. Source codeinformation can include information about individual revisions to asource code file. Further, training component 520 can comprisedifference component 525 that can facilitate training component 520receiving information on code updates or code snippets committed by adeveloper for revisions of a source code file. Training component 520can also include data component 526 that can facilitate trainingcomponent 520 receiving an initial revision of a source code file of acode repository. Data component 526 can persist the initial revision ofthe source code file on training component 520, for example, by storinga copy of the initial revision on local store component 528. Trainingcomponent 520 can, at least in part, reconstruct a source code archivefrom the initial revision received by way of data component 526 and thecode snippets received by way of difference component 525. In an aspect,the interaction of pull component 524, difference component 525, anddata component 526 can provide valuable information about a source codeset stored in a code repository at various stages of development. In anembodiment, the source code files, the code snippets or updates, and theinitial revisions of a source code file, received by training component520 can be stored on local store component 528. In an aspect, theinformation caused to be received by training component 520 by way ofpull component 524, difference component 525, and data component 526 canbe embodied, at least in part, in training data set 522. System 500 canfurther comprise normalization component 527 that can normalize theinformation received at training component 520 by way of pull component524, difference component 525, and data component 526. In an aspect,information normalized by way of normalization component 527 can beembodied, at least in part, in training data set 522.

Training component 520 can also receive trusted coder data set 542.Trusted coder data set 542 can include information related to thecharacteristics features of code composed by one or more trusteddevelopers. In an aspect, trusted coder data set 542 can be an existingdeveloper profile or part of an existing developer profile. In anotheraspect, trusted coder data set 542 can be an existing set of source codefiles for a trusted developer that could be employed to generate aprofile for the trusted developer. In further aspect, trusted coder dataset 542 can be an existing set of source code files for a trusteddeveloper that contribute to generation of a profile for the trusteddeveloper but might be insufficient to do so without further informationon characteristics of the target developer. In an embodiment, trustedcoder data set 542 can be received by way of profile store component512. Trusted coder data set 542 can be stored locally on local storecomponent 528. In an aspect, trusted coder data set 542 can be embodied,at least in part, in training data set 522.

Training component 520 can further include profile engine 540 that cangenerate or update a developer profile based in part on training dataset 522. Profile engine can then facilitate storage of a generatedprofile on profile store component 512. In an aspect, profile engine 540can analyze training data set 522 based on characteristic featuresassociated with a developer to generate a developer profile. Thedeveloper profile can embody one or more characteristics, and in manycases large pluralities of characteristics, to form a pseudo-signaturefor a trusted developer. Many characteristic features mined formtraining data set 522 can be quite subtle, such as particular spellingsof words, punctuation styles, word orders, repetitive use of selectfunction calls or libraries, comment line usage and placement, time/dayof committing code sets, location information, use of identifiedhardware for submissions, etc. Some of the many possible characteristicsare disclosed herein with regard to FIG. 1. It will be appreciated thatthese and many more characteristics can be employed by profile engine540 in generating or updating a profile.

FIG. 6 is a depiction of a system 600 that facilitates code featureanalysis with a weighted profile in accordance with aspects of thesubject disclosure. System 600 can include CRC 602 and CRAC 604communicatively coupled to detection component 630. CRC 602 can comprisea memory, data store, or other data storage component, where code can bestored to facilitate developer access. CRAC 604 can facilitate access toCRC 602, for example, submitted code set 606 can be directed to CRC 602by way of CRAC 604. In an aspect, detection component 630 can beinserted into the communicative coupling between CRAC 604 and CRC 602 tofacilitate aspects of the presently disclosed subject matter, such asemploying developer profiles to detect intrusions into a coderepository. As an example, submitted code set 606 can include a sourcecode file submitted by a malicious entity posing as a trusted developer.Without an intrusion detection system, it is possible that the submittedsource code file would be incorporated in the source code stored in thecode repository and could end up in a build of the source code, wherethe intentions of the malicious entity could be realized. Detectioncomponent 630 can further be communicatively coupled to profile storecomponent 612. Profile store component 612 can store a profile of adeveloper as disclosed herein. Detection component 630 can receive aprofile associated with a developer.

Detection component 630 can include intrusion detection engine 650.Intrusion detection engine 650 can determine intrusion score information614 based on an analysis of submitted code set 606 in view of one ormore developer profile. Developer profiles can be received by way ofprofile store component 612. Developer profiles can be adjusted byprofile weighting component 634. Profile weighting component 634 canreceive information on characteristic features to be included in ananalysis of submitted code set 606 by way of intrusion detection engine650. Profile weighting component 634 can weight developer profiles toadjust the impact of the characteristics represented therein.

As an example, where submitted code set 606 does not include locationinformation related to the geographic location of the device used tosubmit the code files comprising submitted code set 604, intrusiondetection engine 650 can indicate that location characteristic featuresof developer profiles will be ignored. In response to this indication,profile weighting component 634 can designate a weight factor forlocation information to zero, causing any location information inemployed developer profiles to have no effect.

As a second example, developers can have been instructed to includespecific numerical sequences in their previously submitted code files.In this example, a weight factor can increase the effect of a developerprofile having, or not having, these specific numerical sequencesincluded in a relevant subset of source code employed in generatingtheir developer profiles. As such, where these developers included thesequence in their historical code, this will be a generally positiveindicator of their identity when the weighting is adapted to giverelevance to the historic presence of the sequence in historical code.While the sequence may not be present in submitted code set 606,intrusion detection engine 650 can indicate that the developers ofinterest should have historically been including the sequence in theirpreviously submitted code, e.g., the code employed in generating theirprofiles. As such, there should be an increased reliance on finding thesequence as a characteristic feature in confirming that the code baseused to create the relevant developer profile is indeed based on codecomposed by the trusted developer. Where there is greater confidencethat the trusted developer profile is not compromised, there can begreater confidence in the determined intrusion score information 614.

In an embodiment, weighting factors can be embodied in a separate datacomponent, such as a matrix or vector, that can be created once toreflect the weighting determinations of intrusion detection engine 650,such that this separate data component can then be applied to one ormore developer profile, such as by simple matrix operations, vectormultiplication, etc. This can avoid modification of the one or moredeveloper profiles themselves. Further, the weighting factors can becompiled once for application to all included developer profiles. Asexamples, weighting factors can be determined for a single developer,for a single coding project, for a single business entity, for a jointproject between business entities, for a set of code developers, for agroup of project, for an entire code repository, for all developersacross one or more code repositories, or for nearly any other level oftuning, without departing from the scope of the presently disclosedsubject matter.

In an aspect, intrusion detection engine 650 can analyze submitted codeset 606 to determine characteristic features. A subset of thesecharacteristic features, including none, some, or all of the determinedcharacteristic features, can be employed in an analysis against one ormore developer profiles, including weighted developer profiles asdisclosed herein. As an example, submitted code set 606 can include thatit was submitted by a trusted developer, include an unusual style ofdesignating a comment line, a set of 14 words that are regularly spelledin the same incorrect manner in the submitted source code files, and canhave time/day stamps corresponding to the prescribed submission policiesof the owner of the code repository. Continuing the example, the profileof the trusted developer can be received from profile store component612. Profile weighting factor 634 can apply a zero weight factor to allcharacteristic features other than comment line features, spellingfeatures, and temporal features of the developer's profile, in effectaffording them no consideration. Moreover, the temporal features can bereduced in effect, and the spelling features increased in effect, byadjusting corresponding weighting factors by way of profile weightingcomponent 634. Intrusion detection engine can then compare thecharacteristic features of the submitted code set to the weightedcharacteristic features of the trusted developer profile. Where 13 ofthe 14 regularly misspelled words match common misspelling patterns ofthe trusted developer as embodied in the developer profile, the commentstyle is a close match to the developer profile, and the time/day stampsare within typical temporal windows indicated it the developer profile,it can be concluded that there is a only a marginal possibility that thesubmitted code set was not authored by the indicated developer. As such,the submitted code can be accepted for committal to the code repositorywithout further investigation. The level of certainty that the author ofsubmitted code set 606 can be embodied in intrusion information score614.

Adapting the preceding example, where submitted code set 606 pertains tohighly sensitive source code, the marginal possibility that thesubmitting developer is not the trusted developer can be sufficient tocause further investigation of the submitted code. In effect, the levelof confidence in authorship can be used as a trigger for furtherinvestigation. The sensitivity of the trigger can be adjusted byadapting a predetermined threshold trigger level. As an example, athreshold can be 65% confidence for non-critical code submissions, 85%for critical code submissions, and 99% for high security codesubmissions. Thus, a determined confidence of 88% would not triggerfurther investigation except for high security code submissions.

System 600 can further comprise code set destination control component(DEST) 638. DEST 638 can facilitate queuing of submitted code set 606subsequent to determination of intrusion score information 614. In anaspect, DEST 638 can hold submitted code set 606 while an intrusionanalysis is conducted, such that where there is sufficient confidence,the submitted code set 606 can then be committed to CRC 602. Where thereis insufficient confidence, submitted code set 606 can be subjected tofurther scrutiny before committal to the code repository. In anotheraspect, submitted code set 606 can be committed to CRC 602 by DEST 638while DEST 638 restricts access to the written location on CRC 602 untilan intrusion analysis is satisfactorily completed, e.g., a quarantine ofsubmitted code set 606 on CRC 602. It will be appreciated that numerousother techniques for tracking, monitoring, quarantining, or bufferingsubmitted code set 606 can be implemented without departing from, andthat all such techniques are considered within, the scope of the presentsubject matter.

In view of the example system(s) described above, example method(s) thatcan be implemented in accordance with the disclosed subject matter canbe better appreciated with reference to flowcharts in FIG. 7-FIG. 8. Forpurposes of simplicity of explanation, example methods disclosed hereinare presented and described as a series of acts; however, it is to beunderstood and appreciated that the claimed subject matter is notlimited by the order of acts, as some acts may occur in different ordersand/or concurrently with other acts from that shown and describedherein. For example, one or more example methods disclosed herein couldalternatively be represented as a series of interrelated states orevents, such as in a state diagram. Moreover, interaction diagram(s) mayrepresent methods in accordance with the disclosed subject matter whendisparate entities enact disparate portions of the methods. Furthermore,not all illustrated acts may be required to implement a describedexample method in accordance with the subject specification. Furtheryet, two or more of the disclosed example methods can be implemented incombination with each other, to accomplish one or more aspects hereindescribed. It should be further appreciated that the example methodsdisclosed throughout the subject specification are capable of beingstored on an article of manufacture (e.g., a computer-readable medium)to allow transporting and transferring such methods to computers forexecution, and thus implementation, by a processor or for storage in amemory.

FIG. 7 illustrates aspects of method 700 facilitating code featureanalysis in accordance with aspects of the subject disclosure. At 710, atarget code set can be received for analysis. The target code set canpresent at least one characteristic feature that can be employed in theanalysis. Characteristic features can include those described elsewhereherein. The characteristic features can represent patterns associatedwith code composed by specific individual developers. Thesecharacteristics, when taken in combination, can be employed inassociating other code with the individual developers. In an aspect, thecharacteristic features can act as a pseudo-signature. Characteristicfeatures can include features of the code itself, such as spellingpatterns and commenting styles, patterns associated with creation andsubmission of code, such as time/day and submission geographic locationpatterns, patterns associated with environmental factors associated withthe code, such as MAC addresses and other software running in thebackground, etc. It will be appreciated that nearly any aspectsurrounding the composition and submission of source code files can beemployed as a characteristic feature in forming profiles, or analysisbased on these profiles, for code repository intrusion detection asdisclosed herein, and that all such aspects are considered within thescope of the present disclosure.

At 720, a profile can be received, such as a developer profile. Theprofile can be based, at least in part, on a historic code set, such asa set of source code previously composed by a developer. In someembodiments, the historic code set can be comprised of historic code fora plurality of developers including the entity associated with theprofile. The profile can reflect determined historical characteristicfeatures associated with a developer. As an example, where a developernearly always includes a comment line that includes her name and a largeprime number, and where the same developer only occasionally spells‘color’ as ‘colour’, a profile of the developer can indicate a thatthere is a high likelihood of finding a comment line with the developersname and a large prime number and a moderate likelihood of finding theword ‘color’ spelled in the American or English style and that these twocharacteristic features can be employed as factors in determining if atarget code set is composed by the same developer. As such, in thisexample, target code that lacks the diverse forms of ‘color’ may, or maynot, be attributable to the developer because there is only a moderatelikelihood of finding the diverse forms of that particular word in codecomposed by that particular developer. However, where the same targetcode also lacks the developers name and a large prime number in acomment line, it can be more likely that the developer is not the authorof the code based on the historical propensity of the developer toinclude this particular comment line content in her code. Where a largeplurality of characteristic features can be learned and leveraged fordeveloper profiles, the accuracy of determining a developer'saffiliation with a target code set can become a useful tool. As such,where source code is submitted to a code repository, the source code canbe checked against the purported author of the code, such that where thepurported author may not be the actual author of the code, furtheractions can be taken before the submitted source code is relied on, forexample, in a commercialized product, secure environment, etc.

At 730, an indicator value can be determined that is related to acomparison of a characteristic feature of the target code set withaspects of the profile. The indicator value can represent a level ofconfidence that the target code is composed by the developer associatedwith the profile. The determination can include weighting of thecharacteristic features to be analyzed. In an aspect, this weighting canbe employed to reduce computation by giving a zero weight tocharacteristic features that are not present in the target code set. Asan example, where the target code set lacks MAC address, developer name,use of special characters, and submission location characteristics,these characteristic features can be weighted to zero, such as bymultiplication by a scalar zero value. In some embodiments, theseweightings can be in the form of a vector or matrix and applied to aplurality of developer profiles to be compared to the target code set.

At 740, system 700 can facilitate access to the indicator value. At thispoint, method 700 can end. Access to the indicator value can beassociated with reporting processes, subsequent review processes,quarantining processes, profile update processes, etc. As an example, anindicator value can be accessed by a reporting component such that wherethe indicator value transitions a threshold value, a report is generatedand where the indicator value transitions a second threshold level analert is generated in addition to the report. This can facilitate atiered reporting and response to potential code repository intrusions.As a second example, the indicator value can also be accessed by areview component that designates review of the source code set byanother developer. As such, depending on the indicator value, differentreview policies can be observed, such as inter-group review forindicator values within a range of confidence in authorship values andextra-group review for indicator values outside of that range ofconfidence in authorship values.

In an aspect, method 700 can be employed to confirm authorship, refuteauthorship, or determine authorship from a group of potential authors.In an embodiment, a target code set can be analyzed against a profile ofa purported developer such that a measurement of the likelihood thatthat developer either did, or did not, compose the target code set. Inanother embodiment, where a purported developer is not included with thetarget code set, the target code set can be run against a set ofdeveloper profiles. The indicator value determined can be associatedwith ranking the set of developer profiles based on the likelihood thatthey composed the target code set. As such, where a first developer ofthe set has more matches of significance for characteristic featuresthan a second developer for the analysis of the target code set, thenthe first developer can be ranked as more likely to have composed thetarget code set than the second developer. Further, the likelihood thatthe first developer composed the target code set can be evaluateddirectly, e.g., while the first developer is more likely than the seconddeveloper to have composed the target code set, it is still unlikelythat the first developer did indeed compose the target code set based onfewer substantial correlations to the characteristics of the target codeset than a predetermined threshold level. In practice, this can allow acode repository to check submitted code files against a library ofauthorized developers to determine which files are likely composed bywhich developers and which code files are likely not composed by anyauthorized developer. Therefore, even where a first authorized developersubmits a code set including a code file from a second authorizeddeveloper, this dual authorship can be noted without suspicion that anunauthorized developer has actually intruded into the code repository.

FIG. 8 illustrates aspects of method 800 facilitating code featureanalysis profile development in accordance with aspects of the subjectdisclosure. At 810, method 800 can include receiving a training code setassociated with a developer. The training code set can include sourcecode files wherein the source code presents at least one characteristicfeature as disclosed elsewhere herein.

At 820, a developer profile can be determined based on the training codeset. In an aspect, determining the developer profile can includegenerating a new developer profile. In another aspect, determining thedeveloper profile can include updating an existing developer profile.Determining the developer profile can include determining one or morecharacteristic features present in source code included in the trainingcode set. Further, the determination of the developer profile caninclude generating a cost path across the one or more characteristicfeatures.

At 830, weighting of the cost path can occur to adjust the significanceof portions of the cost path across the characteristic features embodiedin a developer profile. Weighting can increase or decrease the effectresulting from encountering a particular characteristic feature, or setof characteristic features, in conjunction with an intrusion analysisemploying a developer profile. As an example highly uniquecharacteristic features, such as special character sequences hidden insource code by a developer as a way of signing the source code, can begiven greater weight than frequent but common characteristic features,such as common alternative spelling of words, for example, ‘meter’ inplace of ‘meter’. As such, the inclusion of a highly unusual sequence ofcharacters in many source code files associated with a particulardeveloper can be strongly weighted in that developer's profile, suchthat when future code is checked against that developer's profile, thepresence or lack of that unusual sequence of characters, as a result ofthe weighting, can more strongly impact the computed validation scorefor that developer as the composer of the future code than it otherwisewould.

In some embodiments, normalization can be conducted to reduce redundancyof data in the resulting developer profile. This can reduce largercollections of data in the profile into smaller sets of interrelateddata to facilitate the isolation of data such that additions, deletions,or modifications of data can propagate through a profile by way of thedefined relationships. Further, normalization can include adaptingdeveloper profiles to allow them to be analyzed by comparable metrics.As an example, a characteristic feature for a first developer can bedetermined to be “60” while the same feature on a second developerprofile is determined to be “580” even though the significance of thecharacteristic features of the two profiles is equivalent. As such, thesecond developer can be normalized to the first developer to allowresults for analysis against the two profiles to be comparable, such asscaling the second profile by a factor of 10 so that the score isnormalized to “58”.

Updating an existing developer profile can also include determining oneor more characteristic features present in source code included in thetraining code set and generating a cost path across the one or morecharacteristic features. However, updating the existing developerprofile includes adapting the existing cost path and characteristicfeatures associated with the developer in view of any new characteristicfeatures gleaned from the training code set or cost path determined forthose characteristic features. Moreover, weighting can includeconsideration of weighting present in the existing developer profile.Similarly, normalization can be applied during updating of a developerprofile to facilitate lower data redundancy and comparable metricsbetween developer profiles.

At 840, access to the developer profile can be facilitated. At thispoint, method 800 can end. Access to the developer profile can be shortor long term. As an example of short-term access, determining adeveloper profile from a set of source code known to be attributable tothe developer, and promptly employing that profile in validating atarget source code set against the developer profile. As an example oflong term access, a developer can have a profile generated, such as byan independent validating organization, and retain their developerprofile on a storage medium, such as a file on a portable drive,thumbdrive, CD, DVD, as a file in a cloud storage medium, etc., suchthat the developer profile can be supplied to interested parties, suchas contracting employers, to validate code submitted by the developerhired by the contracting employer. As another example of long-termaccess, developer profiles can be stored by an entity associated with acode repository, such a corporation who operates a code repository,allowing the entity to catalog developers over time. As such, code canbe checked against a subset of the full set of cataloged developerprofiles. This can be helpful, for example, where a developer leaves acompany and later uses their knowledge of the former employer's coderepository an procedures to attempt to infiltrate the code repository.This can occur by identifying code submitted by the developer throughcomparison to their developer profile cataloged by the former employer.As such, this type of identified source code can be reported to theemployer security team and the code can be subject to further scrutiny.

FIG. 9 is a schematic block diagram of a sample-computing environment900 with which the claimed subject matter can interact. The system 900includes one or more remote component(s) 910, which can includeclient-side component(s). The remote component(s) 910 can be hardwareand/or software (e.g., threads, processes, computing devices). In someembodiments, remote component(s) 910 can include CRAC 104-604. As anexample, remote component(s) 910 can be a developer's home computer(remote from a target server) that has an interface, e.g., CRAC 104,etc., for submitting code to a code repository on a target server. Insome embodiments, CRID 110-610 can be included in remote component(s)910.

The system 900 also includes one or more local component(s) 920, whichcan include server-side component(s). The local component(s) 920 can behardware and/or software (e.g., threads, processes, computing devices).In some embodiments, local component(s) 920 can include CRC 102-602. Asan example, local component(s) 920 can be a target server housing a coderepository that can receive code submissions from remote component(s)910, by way of an interface, e.g., CRAC 104, etc. In some embodiments,CRID 110-610 can be included in local component(s) 920. The localcomponent(s) 920 can house threads to perform transformations byemploying the subject innovation, for example.

One possible communication between a remote component(s) 910 and a localcomponent(s) 920 can be in the form of a data packet adapted to betransmitted between two or more computer processes. As an example, acode set can be communicated between a code developer's computingsystem, e.g., remote component 910, and a code repository, e.g., a localcomponent 920. The system 900 includes a communication framework 940that can be employed to facilitate communications between the remotecomponent(s) 910 and the local component(s) 920. The remote component(s)910 are operably connected to one or more remote data store(s) 950 thatcan be employed to store information on the remote component(s) 910 sideof communication framework 940. Similarly, the local component(s) 920are operably connected to one or more local data store(s) 930 that canbe employed to store information on the to the local component(s) 920side of communication framework 940.

In order to provide a context for the various aspects of the disclosedsubject matter, FIG. 10, and the following discussion, are intended toprovide a brief, general description of a suitable environment in whichthe various aspects of the disclosed subject matter can be implemented.While the subject matter has been described above in the general contextof computer-executable instructions of a computer program that runs on acomputer and/or computers, those skilled in the art will recognize thatthe disclosed subject matter also can be implemented in combination withother program modules. Generally, program modules include routines,programs, components, data structures, etc. that perform particulartasks and/or implement particular abstract data types.

In the subject specification, terms such as “store,” “storage,” “datastore,” data storage,” “database,” and substantially any otherinformation storage component relevant to operation and functionality ofa component, refer to “memory components,” or entities embodied in a“memory” or components comprising the memory. It will be appreciatedthat the memory components described herein can be either volatilememory or nonvolatile memory, or can include both volatile andnonvolatile memory, by way of illustration, and not limitation, volatilememory 1020 (see below), non-volatile memory 1022 (see below), diskstorage 1024 (see below), and memory storage 1046 (see below). Further,nonvolatile memory can be included in read only memory (ROM),programmable ROM (PROM), electrically programmable ROM (EPROM),electrically erasable ROM (EEPROM), or flash memory. Volatile memory caninclude random access memory (RAM), which acts as external cache memory.By way of illustration and not limitation, RAM is available in manyforms such as synchronous RAM (SRAM), dynamic RAM (DRAM), synchronousDRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM(ESDRAM), Synchlink DRAM (SLDRAM), and direct Rambus RAM (DRRAM).Additionally, the disclosed memory components of systems or methodsherein are intended to comprise, without being limited to comprising,these and any other suitable types of memory.

Moreover, it will be noted that the disclosed subject matter can bepracticed with other computer system configurations, includingsingle-processor or multiprocessor computer systems, mini-computingdevices, mainframe computers, as well as personal computers, hand-heldcomputing devices (e.g., PDA, phone, watch, tablet computers, netbookcomputers, . . . ), microprocessor-based or programmable consumer orindustrial electronics, and the like. The illustrated aspects can alsobe practiced in distributed computing environments where tasks areperformed by remote processing devices that are linked through acommunications network; however, some if not all aspects of the subjectdisclosure can be practiced on stand-alone computers. In a distributedcomputing environment, program modules can be located in both local andremote memory storage devices.

FIG. 10 illustrates a block diagram of a computing system 1000 operableto execute the disclosed systems and methods in accordance with anembodiment. Computer 1012, which can be employed, for example, by adeveloper to submit code to a code repository, e.g., CRC 102-602,includes a processing unit 1014, a system memory 1016, and a system bus1018. Computer 1012 can also comprise, for example, CRID 110, 210,training component 320-520, or detection component 630. System bus 1018couples system components including, but not limited to, system memory1016 to processing unit 1014. Processing unit 1014 can be any of variousavailable processors. Dual microprocessors and other multiprocessorarchitectures also can be employed as processing unit 1014.

System bus 1018 can be any of several types of bus structure(s)including a memory bus or a memory controller, a peripheral bus or anexternal bus, and/or a local bus using any variety of available busarchitectures including, but not limited to, Industrial StandardArchitecture (ISA), Micro-Channel Architecture (MSA), Extended ISA(EISA), Intelligent Drive Electronics, VESA Local Bus (VLB), PeripheralComponent Interconnect (PCI), Card Bus, Universal Serial Bus (USB),Advanced Graphics Port (AGP), Personal Computer Memory CardInternational Association bus (PCMCIA), Firewire (IEEE 1194), and SmallComputer Systems Interface (SCSI).

System memory 1016 can include volatile memory 1020 and nonvolatilememory 1022. A basic input/output system (BIOS), containing routines totransfer information between elements within computer 1012, such asduring start-up, can be stored in nonvolatile memory 1022. By way ofillustration, and not limitation, nonvolatile memory 1022 can includeROM, PROM, EPROM, EEPROM, or flash memory. Volatile memory 1020 includesRAM, which acts as external cache memory. By way of illustration and notlimitation, RAM is available in many forms such as SRAM, dynamic RAM(DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM),enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), Rambus direct RAM(RDRAM), direct Rambus dynamic RAM (DRDRAM), and Rambus dynamic RAM(RDRAM).

Computer 1012 can also include removable/non-removable,volatile/non-volatile computer storage media. FIG. 10 illustrates, forexample, disk storage 1024. Disk storage 1024 includes, but is notlimited to, devices like a magnetic disk drive, floppy disk drive, tapedrive, flash memory card, or memory stick. In addition, disk storage1024 can include storage media separately or in combination with otherstorage media including, but not limited to, an optical disk drive suchas a compact disk ROM device (CD-ROM), CD recordable drive (CD-R Drive),CD rewritable drive (CD-RW Drive) or a digital versatile disk ROM drive(DVD-ROM). To facilitate connection of the disk storage devices 1024 tosystem bus 1018, a removable or non-removable interface is typicallyused, such as interface 1026.

Computing devices typically include a variety of media, which caninclude computer-readable storage media or communications media, whichtwo terms are used herein differently from one another as follows.

Computer-readable storage media can be any available storage media thatcan be accessed by the computer and includes both volatile andnonvolatile media, removable and non-removable media. By way of example,and not limitation, computer-readable storage media can be implementedin connection with any method or technology for storage of informationsuch as computer-readable instructions, program modules, structureddata, or unstructured data. Computer-readable storage media can include,but are not limited to, RAM, ROM, EEPROM, flash memory or other memorytechnology, CD-ROM, digital versatile disk (DVD) or other optical diskstorage, magnetic cassettes, magnetic tape, magnetic disk storage orother magnetic storage devices, or other tangible media which can beused to store desired information. In this regard, the term “tangible”herein as may be applied to storage, memory or computer-readable media,is to be understood to exclude only propagating intangible signals perse as a modifier and does not relinquish coverage of all standardstorage, memory or computer-readable media that are not only propagatingintangible signals per se. In an aspect, tangible media can includenon-transitory media wherein the term “non-transitory” herein as may beapplied to storage, memory or computer-readable media, is to beunderstood to exclude only propagating transitory signals per se as amodifier and does not relinquish coverage of all standard storage,memory or computer-readable media that are not only propagatingtransitory signals per se. Computer-readable storage media can beaccessed by one or more local or remote computing devices, e.g., viaaccess requests, queries or other data retrieval protocols, for avariety of operations with respect to the information stored by themedium.

Communications media typically embody computer-readable instructions,data structures, program modules or other structured or unstructureddata in a data signal such as a modulated data signal, e.g., a carrierwave or other transport mechanism, and includes any information deliveryor transport media. The term “modulated data signal” or signals refersto a signal that has one or more of its characteristics set or changedin such a manner as to encode information in one or more signals. By wayof example, and not limitation, communication media include wired media,such as a wired network or direct-wired connection, and wireless mediasuch as acoustic, RF, infrared and other wireless media.

It can be noted that FIG. 10 describes software that acts as anintermediary between users and computer resources described in suitableoperating environment 1000. Such software includes an operating system1028. Operating system 1028, which can be stored on disk storage 1024,acts to control and allocate resources of computer system 1012. Systemapplications 1030 take advantage of the management of resources byoperating system 1028 through program modules 1032 and program data 1034stored either in system memory 1016 or on disk storage 1024. It is to benoted that the disclosed subject matter can be implemented with variousoperating systems or combinations of operating systems.

A user can enter commands or information into computer 1012 throughinput device(s) 1036. As an example, a developer can submit source codeto a code repository, such as thorough CRID 110, 210, etc., by way of auser interface embodied in a touch sensitive display panel allowing adeveloper to interact with computer 1012. Input devices 1036 include,but are not limited to, a pointing device such as a mouse, trackball,stylus, touch pad, keyboard, microphone, joystick, game pad, satellitedish, scanner, TV tuner card, digital camera, digital video camera, webcamera, cell phone, smartphone, tablet computer, etc. These and otherinput devices connect to processing unit 1014 through system bus 1018 byway of interface port(s) 1038. Interface port(s) 1038 include, forexample, a serial port, a parallel port, a game port, a universal serialbus (USB), an infrared port, a Bluetooth port, an IP port, or a logicalport associated with a wireless service, etc. Output device(s) 1040 usesome of the same type of ports as input device(s) 1036.

Thus, for example, a USB port can be used to provide input to computer1012 and to output information from computer 1012 to an output device1040. Output adapter 1042 is provided to illustrate that there are someoutput devices 1040 like monitors, speakers, and printers, among otheroutput devices 1040, which use special adapters. Output adapters 1042include, by way of illustration and not limitation, video and soundcards that provide means of connection between output device 1040 andsystem bus 1018. It should be noted that other devices and/or systems ofdevices provide both input and output capabilities such as remotecomputer(s) 1044.

Computer 1012 can operate in a networked environment using logicalconnections to one or more remote computers, such as remote computer(s)1044. Remote computer(s) 1044 can be a personal computer, a server, arouter, a network PC, cloud storage, cloud service, a workstation, amicroprocessor based appliance, a peer device, or other common networknode and the like, and typically includes many or all of the elementsdescribed relative to computer 1012.

For purposes of brevity, only a memory storage device 1046 isillustrated with remote computer(s) 1044. Remote computer(s) 1044 islogically connected to computer 1012 through a network interface 1048and then physically connected by way of communication connection 1050.Network interface 1048 encompasses wire and/or wireless communicationnetworks such as local-area networks (LAN) and wide-area networks (WAN).LAN technologies include Fiber Distributed Data Interface (FDDI), CopperDistributed Data Interface (CDDI), Ethernet, Token Ring and the like.WAN technologies include, but are not limited to, point-to-point links,circuit-switching networks like Integrated Services Digital Networks(ISDN) and variations thereon, packet switching networks, and DigitalSubscriber Lines (DSL). As noted below, wireless technologies may beused in addition to or in place of the foregoing.

Communication connection(s) 1050 refer(s) to hardware/software employedto connect network interface 1048 to bus 1018. While communicationconnection 1050 is shown for illustrative clarity inside computer 1012,it can also be external to computer 1012. The hardware/software forconnection to network interface 1048 can include, for example, internaland external technologies such as modems, including regular telephonegrade modems, cable modems and DSL modems, ISDN adapters, and Ethernetcards.

The above description of illustrated embodiments of the subjectdisclosure, including what is described in the Abstract, is not intendedto be exhaustive or to limit the disclosed embodiments to the preciseforms disclosed. While specific embodiments and examples are describedherein for illustrative purposes, various modifications are possiblethat are considered within the scope of such embodiments and examples,as those skilled in the relevant art can recognize.

In this regard, while the disclosed subject matter has been described inconnection with various embodiments and corresponding Figures, whereapplicable, it is to be understood that other similar embodiments can beused or modifications and additions can be made to the describedembodiments for performing the same, similar, alternative, or substitutefunction of the disclosed subject matter without deviating therefrom.Therefore, the disclosed subject matter should not be limited to anysingle embodiment described herein, but rather should be construed inbreadth and scope in accordance with the appended claims below.

As it employed in the subject specification, the term “processor” canrefer to substantially any computing processing unit or devicecomprising, but not limited to comprising, single-core processors;single-processors with software multithread execution capability;multi-core processors; multi-core processors with software multithreadexecution capability; multi-core processors with hardware multithreadtechnology; parallel platforms; and parallel platforms with distributedshared memory. Additionally, a processor can refer to an integratedcircuit, an application specific integrated circuit (ASIC), a digitalsignal processor (DSP), a field programmable gate array (FPGA), aprogrammable logic controller (PLC), a complex programmable logic device(CPLD), a discrete gate or transistor logic, discrete hardwarecomponents, or any combination thereof designed to perform the functionsdescribed herein. Processors can exploit nano-scale architectures suchas, but not limited to, molecular and quantum-dot based transistors,switches and gates, in order to optimize space usage or enhanceperformance of user equipment. A processor may also be implemented as acombination of computing processing units.

In the subject specification, terms such as “store,” “storage,” “datastore,” data storage,” “database,” and substantially any otherinformation storage component relevant to operation and functionality ofa component, refer to “memory components,” or entities embodied in a“memory” or components comprising the memory. It will be appreciatedthat the memory components described herein can be either volatilememory or nonvolatile memory, or can include both volatile andnonvolatile memory.

As used in this application, the terms “component,” “system,”“platform,” “layer,” “selector,” “interface,” and the like are intendedto refer to a computer-related entity or an entity related to anoperational apparatus with one or more specific functionalities, whereinthe entity can be either hardware, a combination of hardware andsoftware, software, or software in execution. As an example, a componentmay be, but is not limited to being, a process running on a processor, aprocessor, an object, an executable, a thread of execution, a program,and/or a computer. By way of illustration and not limitation, both anapplication running on a server and the server can be a component. Oneor more components may reside within a process and/or thread ofexecution and a component may be localized on one computer and/ordistributed between two or more computers. In addition, these componentscan execute from various computer readable media having various datastructures stored thereon. The components may communicate via localand/or remote processes such as in accordance with a signal having oneor more data packets (e.g., data from one component interacting withanother component in a local system, distributed system, and/or across anetwork such as the Internet with other systems via the signal). Asanother example, a component can be an apparatus with specificfunctionality provided by mechanical parts operated by electric orelectronic circuitry, which is operated by a software or firmwareapplication executed by a processor, wherein the processor can beinternal or external to the apparatus and executes at least a part ofthe software or firmware application. As yet another example, acomponent can be an apparatus that provides specific functionalitythrough electronic components without mechanical parts, the electroniccomponents can include a processor therein to execute software orfirmware that confers at least in part the functionality of theelectronic components.

In addition, the term “or” is intended to mean an inclusive “or” ratherthan an exclusive “or.” That is, unless specified otherwise, or clearfrom context, “X employs A or B” is intended to mean any of the naturalinclusive permutations. That is, if X employs A; X employs B; or Xemploys both A and B, then “X employs A or B” is satisfied under any ofthe foregoing instances. Moreover, articles “a” and “an” as used in thesubject specification and annexed drawings should generally be construedto mean “one or more” unless specified otherwise or clear from contextto be directed to a singular form.

Furthermore, the terms “user,” “subscriber,” “customer,” “consumer,”“prosumer,” “agent,” and the like are employed interchangeablythroughout the subject specification, unless context warrants particulardistinction(s) among the terms. It should be appreciated that such termscan refer to human entities or automated components (e.g., supportedthrough artificial intelligence, as through a capacity to makeinferences based on complex mathematical formalisms), that can providesimulated vision, sound recognition and so forth.

What has been described above includes examples of systems and methodsillustrative of the disclosed subject matter. It is, of course, notpossible to describe every combination of components or methods herein.One of ordinary skill in the art may recognize that many furthercombinations and permutations of the claimed subject matter arepossible. Furthermore, to the extent that the terms “includes,” “has,”“possesses,” and the like are used in the detailed description, claims,appendices and drawings such terms are intended to be inclusive in amanner similar to the term “comprising” as “comprising” is interpretedwhen employed as a transitional word in a claim.

What is claimed is:
 1. A system, comprising: a memory that storescomputer-executable instructions; and a processor, communicativelycoupled to the memory, that facilitates execution of thecomputer-executable instructions to at least: receive a training codefile set, wherein one or more code files of the training code file setcomprise source code; ascribe a characteristic feature to an entity inresponse to identifying the characteristic feature from the trainingcode file set, a code file of the training code file set, or a computerinstruction of a code file of the training code file set; associate afrequency of occurrence value, and a uniqueness value, with an entityprofile related to the entity in response to determining the frequencyof occurrence value and the uniqueness value for the characteristicfeature; determine a feature value related to the characteristicfeature; and facilitate access to the feature value.
 2. The system ofclaim 1, wherein the training code file set is related to an analysis ofa target code file set.
 3. The system of claim 1, wherein the trainingcode file set includes a plurality of historical code files associatedwith the entity and the characteristic feature is present in two or moreof the plurality of historical code files.
 4. The system of claim 1,wherein the training code file set includes a plurality of historicalcode files associated with a plurality of entities, including theentity, and the characteristic feature is present in less than all ofthe plurality of historical code files.
 5. The system of claim 4,wherein the characteristic feature is present in code files of theplurality of historical code files associated with less than all of theplurality of entities.
 6. The system of claim 1, wherein the featurevalue is based, at least in part, on the frequency of occurrence valueand the uniqueness value.
 7. The system of claim 1, wherein a cost pathacross a plurality of identified characteristics including thecharacteristic feature is determined, based in part on the featurevalue, the frequency of occurrence value, and the uniqueness value,wherein the cost path is further associated with the entity profile. 8.The system of claim 7, wherein one or more weighting factors are appliedto the cost path based on the frequency of occurrence value or theuniqueness value, to modify the effect of the characteristic feature ina future analysis employing the cost path.
 9. The system of claim 1,wherein the characteristic feature is analyzed against the entityprofile based on the feature value.
 10. The system of claim 9, whereinthe analysis of the characteristic feature against the entity profiledetermines an identity value associated with a level of confidence thata target code file set is, at least in part, attributable to the entityassociated with the entity profile.
 11. The system of claim 9, whereinthe characteristic feature is analyzed against a set of entity profiles,including the entity profile, based on the feature value, wherein theanalysis determines an identity value associated with a level ofconfidence that a target code file set is, at least in part,attributable to one or more entities of the set of entity profiles. 12.The system of claim 9, wherein a target code file set, or a portionthereof, is subjected to further scrutiny based on a determined identityvalue satisfying a predetermined condition, wherein the determinedidentity value is associated with a level of confidence that the targetcode file set is attributable to the entity.
 13. A method, comprising:receiving, by a system including a processor, a code file set includingone or more code files comprising source code, wherein the code file setcomprises a training code file set; in response to analyzing, by thesystem, the one or more code files comprising the code file set,identifying, by the system, a characteristic feature present in at leasta portion of the code file set; determining, by the system, a frequencyof occurrence value for the characteristic feature; determining, by thesystem, a uniqueness value for the characteristic feature; associating,by the system, the frequency of occurrence value and the uniquenessvalue with an entity profile associated with an identity; determining,by the system, a feature score for the characteristic feature; andfacilitating, by the system, access to the feature score.
 14. The methodof claim 13, wherein the identifying the characteristic feature includesidentifying the characteristic feature as part of training the entityprofile associated with the identity.
 15. The method of claim 14,wherein the receiving the training code file set includes receiving aplurality of historical code files associated with a plurality ofidentities, and the identifying a characteristic feature includesidentifying the characteristic feature as part of training a pluralityof profiles comprising the entity profile.
 16. The method of claim 13,wherein identifying the characteristic feature includes determining oneor more weighting factors to be applied to a cost path based on theuniqueness value.
 17. The method of claim 16, wherein identifying thecharacteristic feature includes determining one or more weightingfactors to be applied to a cost path based on the frequency ofoccurrence value.
 18. The method of claim 13, further comprising:determining a cost path across a plurality of identifiedcharacteristics, including the characteristic feature, based in part onthe feature score, the frequency of occurrence value, and the uniquenessvalue; and associating the cost path with the entity profile.
 19. Themethod of claim 13, wherein the receiving a code file set includesreceiving a target code file set and the identifying the characteristicfeature includes analyzing a target code set against the entity profileassociated with the identity.
 20. The method of claim 19, wherein theanalyzing the target code set includes determining an identity valueassociated with a level of confidence that the target code file set is,at least in part, composed by an entity associated with the identity.21. The method of claim 19, wherein the analyzing the target code setincludes analyzing the target code set against a set of profilesassociated with a plurality of identities, including the identity anddetermining an identity value associated with a level of confidence thatthe target code file set is, at least in part, composed by an entityassociated with an identity of the plurality of identities associatedwith the set of profiles.
 22. The method of claim 19, wherein theanalyzing the target code set includes subjecting at least a portion ofthe target code set to further scrutiny based on determining an identityvalue that satisfies a predetermined condition, the identity value beingassociated with a level of confidence that the target code file set isattributable to an entity associated with the identity.
 23. Acomputer-readable storage device comprising computer-executableinstructions that, in response to execution, cause a device including aprocessor to perform operations, comprising: receiving a training codefile set, wherein one or more code file of the training code file setcomprises source code; associating a characteristic feature with anentity in response to identifying the characteristic feature from thetraining code file set, a code file of the training code file set, or acomputer instruction of a code file of the training code file set;associating a frequency of occurrence value, for the characteristicfeature, with an entity profile related to the entity; associating auniqueness value, for the characteristic feature, with the entityprofile; determining a feature value related to the characteristicfeature based on the entity profile; and determining an intrusion scorebased on a determined level of confidence that a target code set isauthored by a developer based on the feature value.
 24. Thecomputer-readable storage device of claim 23, wherein the operationsfurther comprise: receiving a target code file set, wherein one or moretarget code files of the code file set comprise source code; receiving adeveloper profile comprising a set of characteristic feature valuesassociated with a presence of characteristic features correlated tohistorical code files associated with the developer profile; anddetermining an update to the developer profile based on analysis ofcharacteristic features associated with the training code set.
 25. Thecomputer-readable storage device of claim 23, wherein the operationsfurther comprise: facilitating further scrutiny of the target code setbased on the intrusion score satisfying a predetermined conditionassociated with the developer profile.